最大化单调性函数是机器学习,经济学和统计数据中的一项基本任务。在本文中,我们提出了单调连续DR-submodular最大化问题的两种通信效率分散的在线算法,这两者都减少了函数梯度评估的数量,并从$ t^{3/2}中降低了每轮的通信复杂性$至$ 1 $。第一个,单发的分散式元弗兰克 - 沃尔夫(Mono-dmfw),达到了$(1-1/e)$ - 遗憾的是$ o(t^{4/5})$。据我们所知,这是单调连续DR-submodular Maximization的第一个单发和无投射分散的在线算法。接下来,受到非界化的增强功能\ citep {zhang2022boosting}的启发,我们提出了分散的在线增强梯度上升(dobga)算法,该算法获得了$(1-1/e)$ - 遗憾的是$(\ sqrt {\ sqrt { t})$。据我们所知,这是获得$(1-1/e)$的最佳$ o(\ sqrt {t})$的第一个结果步。最后,各种实验结果证实了所提出的方法的有效性。
translated by 谷歌翻译
在本文中,我们在下闭合的凸套装上重新审视了在线非单调的DR-Submodular Mavimivel问题,该凸套装在机器学习,经济学和操作研究的领域中找到了广泛的现实世界应用。首先,我们以$ o(\ sqrt {t})$的价格呈现元MFW算法,价格为$ t^{3/2} $每回合。据我们所知,Meta-MFW是第一个获得$ 1/e $ - regret $ o(\ sqrt {t})$的算法放。此外,与ODC算法\ citep {thang2021online}形成鲜明对比的是,meta-mfw依赖于简单的在线线性甲骨文而无需离散化,提升或舍入操作。考虑到实用限制,我们然后提出了单声道-MFW算法,该算法将每个功能的随机梯度评估从$ t^{3/2} $减少到1,并实现$ 1/e $ -e $ -e-regret BOND $ O(t ^{4/5})$。接下来,我们将Mono-MFW扩展到Bandit设置,并提出Bandit-MFW算法,该算法获得了$ 1/e $ - regret键的$ O(t^{8/9})$。据我们所知,Mono-MFW和Bandit-MFW是第一个探索在线非占用dr dr-submodumarmimization thy pownlosed convex set的sumblinear-regret算法,可以探索单发和强盗设置。最后,我们对合成数据集和现实数据集进行了数值实验,以验证我们方法的有效性。
translated by 谷歌翻译
在本文中,我们在离线和在线设置中重新审视受约束和随机连续的子模块最大化。对于每个$ \ gamma $ -weakly dr-subsodular函数$ f $,我们使用因子显示优化方程来获得最佳辅助函数$ f $,其静止点提供$(1-e ^ { - \ gamma} )$ - 近似于全局最大值(表示为$ OPT $)的问题$ \ max _ {\ boldsymbol {x} \ in \ mathcal {c}} f(\ boldsymbol {x})$。当然,预计(镜子)渐变上升依赖于这种非忽视功能实现$(1-e ^ { - \ gamma} - \ epsilon ^ {2})Opt- \ epsilon $ o在$ o(1 / \ epsilon ^ {2})$迭代,击败传统$(\ frac {\ gamma ^ {2}} {1+ \ gamma ^ {2}})$ - 近似渐变上升\ citep {hassani2017gradientient},用于子模块的最大化。同样,基于$ F $,配备veriance减少技术的经典弗兰克 - 沃尔夫算法\ citep {mokhtari2018conditional}也返回一个大于$大于$(1-e ^ { - \ gamma} - \ epsilon ^ {2的解决方案})OPT- \ epsilon $ o $ o(1 / \ epsilon ^ {3})$迭代。在在线设置中,我们首先考虑随机梯度反馈的对抗延迟,我们提出了一种促进了具有相同非忽视搜索的在线梯度算法,实现了$ \ sqrt {d} $的遗憾(其中$ d $ where梯度反馈延迟的总和(1-e ^ { - \ gamma})$ - 近似到后智中最佳可行解决方案。最后,广泛的数值实验表明了我们提升方法的效率。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.
translated by 谷歌翻译
Rigorous guarantees about the performance of predictive algorithms are necessary in order to ensure their responsible use. Previous work has largely focused on bounding the expected loss of a predictor, but this is not sufficient in many risk-sensitive applications where the distribution of errors is important. In this work, we propose a flexible framework to produce a family of bounds on quantiles of the loss distribution incurred by a predictor. Our method takes advantage of the order statistics of the observed loss values rather than relying on the sample mean alone. We show that a quantile is an informative way of quantifying predictive performance, and that our framework applies to a variety of quantile-based metrics, each targeting important subsets of the data distribution. We analyze the theoretical properties of our proposed method and demonstrate its ability to rigorously control loss quantiles on several real-world datasets.
translated by 谷歌翻译
Recently, large-scale pre-trained models have shown their advantages in many tasks. However, due to the huge computational complexity and storage requirements, it is challenging to apply the large-scale model to real scenes. A common solution is knowledge distillation which regards the large-scale model as a teacher model and helps to train a small student model to obtain a competitive performance. Cross-task Knowledge distillation expands the application scenarios of the large-scale pre-trained model. Existing knowledge distillation works focus on directly mimicking the final prediction or the intermediate layers of the teacher model, which represent the global-level characteristics and are task-specific. To alleviate the constraint of different label spaces, capturing invariant intrinsic local object characteristics (such as the shape characteristics of the leg and tail of the cattle and horse) plays a key role. Considering the complexity and variability of real scene tasks, we propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios. First, to better transfer the generalized knowledge in the teacher model in cross-task scenarios, we propose a prototype learning module to learn from the essential feature representation of objects in the teacher model. Secondly, for diverse downstream tasks, we propose a task-adaptive feature augmentation module to enhance the features of the student model with the learned generalization prototype features and guide the training of the student model to improve its generalization ability. The experimental results on various visual tasks demonstrate the effectiveness of our approach for large-scale model cross-task knowledge distillation scenes.
translated by 谷歌翻译
Crowd counting plays an important role in risk perception and early warning, traffic control and scene statistical analysis. The challenges of crowd counting in highly dense and complex scenes lie in the mutual occlusion of the human body parts, the large variation of the body scales and the complexity of imaging conditions. Deep learning based head detection is a promising method for crowd counting. However the highly concerned object detection networks cannot be well applied to this field for two main reasons. First, most of the existing head detection datasets are only annotated with the center points instead of bounding boxes which is mandatory for the canonical detectors. Second, the sample imbalance has not been overcome yet in highly dense and complex scenes because the existing loss functions calculate the positive loss at a single key point or in the entire target area with the same weight. To address these problems, We propose a novel loss function, called Mask Focal Loss, to unify the loss functions based on heatmap ground truth (GT) and binary feature map GT. Mask Focal Loss redefines the weight of the loss contributions according to the situ value of the heatmap with a Gaussian kernel. For better evaluation and comparison, a new synthetic dataset GTA\_Head is made public, including 35 sequences, 5096 images and 1732043 head labels with bounding boxes. Experimental results show the overwhelming performance and demonstrate that our proposed Mask Focal Loss is applicable to all of the canonical detectors and to various datasets with different GT. This provides a strong basis for surpassing the crowd counting methods based on density estimation.
translated by 谷歌翻译
Market sentiment analysis on social media content requires knowledge of both financial markets and social media jargon, which makes it a challenging task for human raters. The resulting lack of high-quality labeled data stands in the way of conventional supervised learning methods. Instead, we approach this problem using semi-supervised learning with a large language model (LLM). Our pipeline generates weak financial sentiment labels for Reddit posts with an LLM and then uses that data to train a small model that can be served in production. We find that prompting the LLM to produce Chain-of-Thought summaries and forcing it through several reasoning paths helps generate more stable and accurate labels, while using a regression loss further improves distillation quality. With only a handful of prompts, the final model performs on par with existing supervised models. Though production applications of our model are limited by ethical considerations, the model's competitive performance points to the great potential of using LLMs for tasks that otherwise require skill-intensive annotation.
translated by 谷歌翻译